Submitted by Group 17:
from CS 132 WFU
We first import the necessary modules.
!pip install chart_studio
Requirement already satisfied: chart_studio in c:\users\julia\anaconda3\lib\site-packages (1.1.0) Requirement already satisfied: plotly in c:\users\julia\anaconda3\lib\site-packages (from chart_studio) (5.9.0) Requirement already satisfied: six in c:\users\julia\anaconda3\lib\site-packages (from chart_studio) (1.16.0) Requirement already satisfied: retrying>=1.3.3 in c:\users\julia\anaconda3\lib\site-packages (from chart_studio) (1.3.4) Requirement already satisfied: requests in c:\users\julia\anaconda3\lib\site-packages (from chart_studio) (2.28.1) Requirement already satisfied: tenacity>=6.2.0 in c:\users\julia\anaconda3\lib\site-packages (from plotly->chart_studio) (8.0.1) Requirement already satisfied: certifi>=2017.4.17 in c:\users\julia\anaconda3\lib\site-packages (from requests->chart_studio) (2023.5.7) Requirement already satisfied: idna<4,>=2.5 in c:\users\julia\anaconda3\lib\site-packages (from requests->chart_studio) (3.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\julia\anaconda3\lib\site-packages (from requests->chart_studio) (1.26.14) Requirement already satisfied: charset-normalizer<3,>=2 in c:\users\julia\anaconda3\lib\site-packages (from requests->chart_studio) (2.0.4)
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.io as pio
import plotly.offline as po
import plotly.express as px
import plotly.graph_objects as go
import chart_studio
pio.renderers.default = 'notebook'
po.init_notebook_mode()
The researchers collected data in two different phases:
It will be further explained in a future section, but the researchers decided to sample only 126 additional tweets in order to narrow the scope to the years 2020-2022. This would disinclude 24 tweets from the Phase 1 dataset.
To properly integrate the two datasets, we conduct the data preprocessing and data exploration in separate notebooks in order to have a clearer separation between the research phases.
We import the original dataset and store it in a dataframe named original_dataset. We clone the dataset into a dataframe called original_tweets to be able to manipulate the data non-destructively.
url = "https://raw.githubusercontent.com/jolyuh/cs132-grp17-portfolio/main/assets/dataset/Combined%20Dataset%20-%20Group%2017.xlsx"
original_dataset = pd.read_excel(url)
original_tweets = original_dataset.copy()
original_tweets.shape
(150, 37)
We import the additional/supplemental dataset and store it in a dataframe called addtl_dataset. Just like with the original, we clone the dataset into a dataframe called addtl_tweets to be able to manipulate the data non-destructively.
url_2 = "https://raw.githubusercontent.com/jolyuh/cs132-grp17-portfolio/main/assets/dataset/Additional%20Tweets%20(Sample)%20-%20Group%2017.xlsx"
addtl_dataset = pd.read_excel(url_2)
addtl_tweets = addtl_dataset.copy()
addtl_tweets.shape
(126, 37)
The additional data was collected so that we would be able to conduct statistical tests on the collected tweets. Since the data collected in Phase 1 only consisted of red-tagging tweets, and the research hypothesis involved categorizing between Marcos, Duterte, or non-supporters, having only red-tagging tweets made it impossible to conduct the tests.
The data collected in Phase 2 consists of both red-tagging and non-redtagging tweets manually tagged by the researchers, and so two columns were added:
True or False"Positive", "Neutral", or "Negative"The red-tagging column was added, self-explanatorily, to separate red-tagging from non-redtagging. However, the researchers felt the need to manually identify the Tone of the tweet, because not all non-redtagging tweets were necessarily positive in nature (e.g. "Anakbayan is trash!" isn't red-tagging, but still has an overall negative sentiment towards the organization).
To combine the datasets, we first add two columns to the original dataset. It would be easy to populate, since all of the tweets in the original dataset are red-tagging and negative in tone.
# Add Red-tagging and Tone columns to original dataset
original_tweets['Red-tagging'] = True
original_tweets['Tone'] = "Negative"
original_tweets.head(5)
| un | Timestamp | Tweet URL | Group | Collector | Category | Topic | Keywords | Account handle | Account name | ... | Rating | Reasoning | Remarks | Marcos supporter | Duterte supporter | Explanation for the political stance | Red-tagging | Tone | Reviewer | Review | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17-1 | 2023-03-22 15:36:55 | https://twitter.com/DelicadoJuanito/status/130... | 17 | Alfonso, Francis Donald | REDT | AnakBayan is a terrorist organization | Anakbayan, NPA | @DelicadoJuanito | Juanito Delicado | ... | UNPROVEN | Twelve months after the murders, no suspect ha... | No location.\n\nThis tweet is a reply on a pos... | False | True | Unknown stance on Marcos\n\nTweet about Dutert... | True | Negative | NaN | NaN |
| 1 | 17-2 | 2023-03-26 20:25:45 | https://twitter.com/MDSOnwardPH22/status/12975... | 17 | Alfonso, Francis Donald | REDT | AnakBayan is a terrorist organization | Anakbayan, NPA | @MDSOnwardPH22 | ❤💞💞🇵🇭🇵🇭🙏🙏MARCOS-DUTERTE👊🏽👊🏽👊🏽 | ... | False | Labelling Anakbayan as the legal NPA front org... | The video is alluding that Anakbayan is part o... | True | True | Account name | True | Negative | NaN | NaN |
| 2 | 17-3 | 2023-03-26 22:34:19 | https://twitter.com/ysaysantos24/status/127895... | 17 | Alfonso, Francis Donald | REDT | AnakBayan is a terrorist organization | Anakbayan, NPA | @ysaysantos24 | Relissa Lucena | ... | False | Labelling Anakbayan as the legal NPA front org... | No location.\n\nStart of a thread | False | True | Unknown stance on Marcos \n\nLikes tweets supp... | True | Negative | NaN | NaN |
| 3 | 17-4 | 2023-03-30 02:44:36.674000 | https://twitter.com/Dbigbalbowski/status/97052... | 17 | Dizon, Julia Francesca | REDT | AnakBayan is a terrorist organization | Anakbayan, terorista | @Dbigbalbowski | Mr James🇵🇭 | ... | UNPROVEN | It has not been proven that Myles Albasin is p... | No bio.\n\nThe tweet contains a collage of pic... | True | True | Liked tweets supporting Duterte and Marcos. | True | Negative | NaN | NaN |
| 4 | 17-5 | 2023-03-30 02:49:23.538000 | https://twitter.com/Earth751/status/9700304405... | 17 | Dizon, Julia Francesca | REDT | AnakBayan is a terrorist organization | Anakbayan, terorista, NPA | @Earth751 | Earth@75 | ... | NaN | Labelling Anakbayan as the legal NPA front org... | Account was last active in 2018. | False | True | Tweets and likes content in support of or rela... | True | Negative | NaN | NaN |
5 rows × 37 columns
We then concatenate the two datasets into a dataframe called final_dataset and clone that into a dataframe named tweets to be used in the data preprocessing proper.
final_dataset = pd.concat([original_tweets, addtl_tweets], ignore_index=True)
tweets = final_dataset.copy()
tweets.shape
(276, 38)
Since the original dataset contains columns that are not needed for the data exploration, we drop those columns, namely:
These columns' purpose is to help the researchers distinguish the samples on a meta level and are not necessary for analysis.
tweets.columns
Index(['un', 'Timestamp', 'Tweet URL', 'Group', 'Collector', 'Category',
'Topic', 'Keywords', 'Account handle', 'Account name', 'Account bio',
'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet',
'Tweet Translated', 'Tweet Type', 'Date posted', 'Screenshot',
'Content type', 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Views',
'Rating', 'Reasoning', 'Remarks', 'Marcos supporter',
'Duterte supporter', 'Explanation for the political stance',
'Red-tagging', 'Tone', 'Reviewer', 'Review', 'ID'],
dtype='object')
tweets = tweets.drop(columns=[
'ID',
'Timestamp',
'Group',
'Collector',
'Category',
'Topic',
'Keywords',
'Reviewer',
'Review',
'Screenshot',
'Views',
'Rating',
'Reasoning',
'Remarks',
'Reviewer',
'Review'
]
)
For ease of coding, we rename each of the column names in snake case format.
column_names = np.array(tweets.columns, copy=True).tolist()
column_names
['un', 'Tweet URL', 'Account handle', 'Account name', 'Account bio', 'Account type', 'Joined', 'Following', 'Followers', 'Location', 'Tweet', 'Tweet Translated', 'Tweet Type', 'Date posted', 'Content type', 'Likes', 'Replies', 'Retweets', 'Quote Tweets', 'Marcos supporter', 'Duterte supporter', 'Explanation for the political stance', 'Red-tagging', 'Tone']
def snake_caseify_column(name):
if name == 'Explanation for the political stance':
return 'stance_explanation'
elif name == 'Red-tagging':
return 'red_tagging'
return '_'.join(name.lower().split())
new_col_names = list(map(snake_caseify_column, column_names))
new_col_names
['un', 'tweet_url', 'account_handle', 'account_name', 'account_bio', 'account_type', 'joined', 'following', 'followers', 'location', 'tweet', 'tweet_translated', 'tweet_type', 'date_posted', 'content_type', 'likes', 'replies', 'retweets', 'quote_tweets', 'marcos_supporter', 'duterte_supporter', 'stance_explanation', 'red_tagging', 'tone']
tweets.columns = new_col_names
tweets.head(5)
| un | tweet_url | account_handle | account_name | account_bio | account_type | joined | following | followers | location | ... | content_type | likes | replies | retweets | quote_tweets | marcos_supporter | duterte_supporter | stance_explanation | red_tagging | tone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17-1 | https://twitter.com/DelicadoJuanito/status/130... | @DelicadoJuanito | Juanito Delicado | KKK Philippines member last generation Miriam ... | Anonymous | 2020-08-01 | 38.0 | 3 | NaN | ... | Emotional | 0.0 | 0.0 | 0.0 | 0.0 | False | True | Unknown stance on Marcos\n\nTweet about Dutert... | True | Negative |
| 1 | 17-2 | https://twitter.com/MDSOnwardPH22/status/12975... | @MDSOnwardPH22 | ❤💞💞🇵🇭🇵🇭🙏🙏MARCOS-DUTERTE👊🏽👊🏽👊🏽 | When helping the poor,leave the camera at home! | Anonymous | 2018-05-01 | 7982.0 | 12431 | Manila | ... | Emotional | 37.0 | 2.0 | 14.0 | 4.0 | True | True | Account name | True | Negative |
| 2 | 17-3 | https://twitter.com/ysaysantos24/status/127895... | @ysaysantos24 | Relissa Lucena | YaKaP ng Magulang\nAj is my life, maging makat... | Identified | 2020-01-01 | 183.0 | 953 | NaN | ... | Emotional | 84.0 | 4.0 | 28.0 | 3.0 | False | True | Unknown stance on Marcos \n\nLikes tweets supp... | True | Negative |
| 3 | 17-4 | https://twitter.com/Dbigbalbowski/status/97052... | @Dbigbalbowski | Mr James🇵🇭 | NaN | Anonymous | 2011-05-01 | 1676.0 | 8352 | New York, USA | ... | Emotional | 0.0 | 0.0 | 0.0 | 0.0 | True | True | Liked tweets supporting Duterte and Marcos. | True | Negative |
| 4 | 17-5 | https://twitter.com/Earth751/status/9700304405... | @Earth751 | Earth@75 | ako y simpling tao maka diyos.makatao at makab... | Anonymous | 2017-03-01 | 9.0 | 2 | KSA KHOBAR | ... | Emotional | 0.0 | 0.0 | 0.0 | 0.0 | False | True | Tweets and likes content in support of or rela... | True | Negative |
5 rows × 24 columns
Below is a summary of the dataframe information. It shows us at a glance the number of non-null entries in the dataframe. Because we have 150 samples, if the number of non-null objects is less that 150, then there are some holes in our data. For our data pre-processing, then, we investigate these holes.
tweets.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 276 entries, 0 to 275 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 un 150 non-null object 1 tweet_url 276 non-null object 2 account_handle 276 non-null object 3 account_name 275 non-null object 4 account_bio 233 non-null object 5 account_type 276 non-null object 6 joined 276 non-null datetime64[ns] 7 following 150 non-null float64 8 followers 276 non-null int64 9 location 159 non-null object 10 tweet 276 non-null object 11 tweet_translated 195 non-null object 12 tweet_type 276 non-null object 13 date_posted 276 non-null object 14 content_type 275 non-null object 15 likes 275 non-null float64 16 replies 275 non-null float64 17 retweets 275 non-null float64 18 quote_tweets 275 non-null float64 19 marcos_supporter 276 non-null bool 20 duterte_supporter 276 non-null bool 21 stance_explanation 275 non-null object 22 red_tagging 276 non-null bool 23 tone 275 non-null object dtypes: bool(3), datetime64[ns](1), float64(5), int64(1), object(14) memory usage: 46.2+ KB
# This summarizes the columns that do have null values.
for i in tweets.columns[tweets.isna().any()].tolist():
print(i)
un account_name account_bio following location tweet_translated content_type likes replies retweets quote_tweets stance_explanation tone
However, while we should be filling in holes where we can, not all values are required nor available. For example, a Twitter account is not required to have a Location or an Account bio, which is why there are more null values in their columns than the others. The column Views is also largely null because the Views feature of Tweets only started rolling out in late December 2022, which is only a small period in the date range covered by our data. Thus, for our data clean-up, we only pay attention to null values that represent an actual lack in information that is needed for our research. These relevant columns are:
Account nameContent typeLikesRepliesRetweetsQuote TweetsTo correct fields about Tweet or Twitter account info, our method of fixing it will be to replace the NaN values with what is currently. For fields that require our own assessment, we will be filling them out with our input as well.
tweets[tweets['account_name'].isna()]
| un | tweet_url | account_handle | account_name | account_bio | account_type | joined | following | followers | location | ... | content_type | likes | replies | retweets | quote_tweets | marcos_supporter | duterte_supporter | stance_explanation | red_tagging | tone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 37 | 17-38 | https://twitter.com/crux_sonata/status/1270296... | @crux_sonata | NaN | mahalin natin aNg pilipinas | Anonymous | 2016-11-01 | 272.0 | 1416 | Philippines | ... | Emotional | 1.0 | 0.0 | 0.0 | 0.0 | False | True | Has posts supporting Duterte | True | Negative |
1 rows × 24 columns
tweets.at[37,'account_name']="Crux of the Matter 🕊🏃♀️🦅🏃♀️"
tweets[tweets['content_type'].isna()]
| un | tweet_url | account_handle | account_name | account_bio | account_type | joined | following | followers | location | ... | content_type | likes | replies | retweets | quote_tweets | marcos_supporter | duterte_supporter | stance_explanation | red_tagging | tone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 17-28 | https://twitter.com/dyslexiczs/status/94222907... | @dyslexiczs | ᜉᜐᜅ͓ ᜄᜎ | I am dyslexic and my opinion matters\n | Anonymous | 2016-10-01 | 279.0 | 79 | NaN | ... | NaN | NaN | NaN | NaN | NaN | False | True | Supports Duterte [1]\n\nDoes not support Marco... | True | Negative |
1 rows × 24 columns
tweets.at[27, 'content_type'] = "Emotional"
tweets[tweets['likes'].isna()]
| un | tweet_url | account_handle | account_name | account_bio | account_type | joined | following | followers | location | ... | content_type | likes | replies | retweets | quote_tweets | marcos_supporter | duterte_supporter | stance_explanation | red_tagging | tone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 17-28 | https://twitter.com/dyslexiczs/status/94222907... | @dyslexiczs | ᜉᜐᜅ͓ ᜄᜎ | I am dyslexic and my opinion matters\n | Anonymous | 2016-10-01 | 279.0 | 79 | NaN | ... | Emotional | NaN | NaN | NaN | NaN | False | True | Supports Duterte [1]\n\nDoes not support Marco... | True | Negative |
1 rows × 24 columns
tweets[tweets['replies'].isna()]
| un | tweet_url | account_handle | account_name | account_bio | account_type | joined | following | followers | location | ... | content_type | likes | replies | retweets | quote_tweets | marcos_supporter | duterte_supporter | stance_explanation | red_tagging | tone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 17-28 | https://twitter.com/dyslexiczs/status/94222907... | @dyslexiczs | ᜉᜐᜅ͓ ᜄᜎ | I am dyslexic and my opinion matters\n | Anonymous | 2016-10-01 | 279.0 | 79 | NaN | ... | Emotional | NaN | NaN | NaN | NaN | False | True | Supports Duterte [1]\n\nDoes not support Marco... | True | Negative |
1 rows × 24 columns
tweets[tweets['retweets'].isna()]
| un | tweet_url | account_handle | account_name | account_bio | account_type | joined | following | followers | location | ... | content_type | likes | replies | retweets | quote_tweets | marcos_supporter | duterte_supporter | stance_explanation | red_tagging | tone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 17-28 | https://twitter.com/dyslexiczs/status/94222907... | @dyslexiczs | ᜉᜐᜅ͓ ᜄᜎ | I am dyslexic and my opinion matters\n | Anonymous | 2016-10-01 | 279.0 | 79 | NaN | ... | Emotional | NaN | NaN | NaN | NaN | False | True | Supports Duterte [1]\n\nDoes not support Marco... | True | Negative |
1 rows × 24 columns
tweets[tweets['quote_tweets'].isna()]
| un | tweet_url | account_handle | account_name | account_bio | account_type | joined | following | followers | location | ... | content_type | likes | replies | retweets | quote_tweets | marcos_supporter | duterte_supporter | stance_explanation | red_tagging | tone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 17-28 | https://twitter.com/dyslexiczs/status/94222907... | @dyslexiczs | ᜉᜐᜅ͓ ᜄᜎ | I am dyslexic and my opinion matters\n | Anonymous | 2016-10-01 | 279.0 | 79 | NaN | ... | Emotional | NaN | NaN | NaN | NaN | False | True | Supports Duterte [1]\n\nDoes not support Marco... | True | Negative |
1 rows × 24 columns
# The sample with NaN value (tweet #27) is the same across these four characteristics:
# All these values are 0 for this tweet
tweets.at[27, 'likes'] = 0
tweets.at[27, 'replies'] = 0
tweets.at[27, 'retweets'] = 0
tweets.at[27, 'quote_tweets'] = 0
tweets[tweets['stance_explanation'].isna()]
| un | tweet_url | account_handle | account_name | account_bio | account_type | joined | following | followers | location | ... | content_type | likes | replies | retweets | quote_tweets | marcos_supporter | duterte_supporter | stance_explanation | red_tagging | tone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25 | 17-26 | https://twitter.com/merilla2010/status/9398245... | @merilla2010 | Marski 👊👊👊 | Simple Man | Anonymous | 2010-08-01 | 612.0 | 1035 | Saudi Arabia | ... | Emotional | 4.0 | 1.0 | 2.0 | 0.0 | False | False | NaN | True | Negative |
1 rows × 24 columns
Since the lack of an explanation means the researchers overlooked to assess the tweeter's political stance, we modify the values here.
tweets.at[25, 'marcos_supporter'] = False
tweets.at[25, 'duterte_supporter'] = True
tweets.at[25, 'stance_explanation'] = "Display name has fist emojis commonly associated with Duterte. Not enough data to show support for Marcos."
Date posted column¶During the process of data visualization in a subsequent section, it was found that not all of the values in the Date posted column were read by Pandas as Python datetime objects, because they were in the DD/MM/YY HH:MM format instead of the YYYY-MM-DD HH:MM:SS format. Thus, they were read as str objects instead.
from datetime import datetime
string_dates = tweets[tweets['date_posted'].apply(lambda x: isinstance(x, str))]
datetime_dates = tweets[tweets['date_posted'].apply(lambda x: isinstance(x, datetime))]
print(f"Dates in str format: {string_dates.shape[0]}")
print(f"Dates in datetime format: {datetime_dates.shape[0]}")
Dates in str format: 30 Dates in datetime format: 246
string_dates['date_posted'].head(5)
44 14/05/22 10:31 46 15/04/21 08:51 47 27/01/21 15:34 48 29/10/20 10:45 50 24/02/22 10:56 Name: date_posted, dtype: object
datetime_dates['date_posted'].head(5)
0 2020-08-30 19:30:00 1 2020-08-23 20:12:00 2 2020-07-03 15:26:00 3 2018-03-05 13:16:40 4 2018-03-04 04:17:33 Name: date_posted, dtype: object
To fix this, we replace the original Date posted column with a modified version that creates a datetime object based on the value from the DD/MM/YY HH:MM formatted string.
def get_date_slice(date):
date_arr = date.split('/')
return list(map(lambda x: int(x), date_arr))
def get_time_slice(time):
time_arr = time.split(':')
return list(map(lambda x: int(x), time_arr))
def get_datetime_from_str(date_str):
if isinstance(date_str, datetime):
return date_str
split_date_time = date_str.split(' ')
date = get_date_slice(split_date_time[0])
time = get_time_slice(split_date_time[1])
return datetime(2000 + date[2], date[1], date[0], time[0], time[1])
tweets_test = tweets['date_posted'].map(get_datetime_from_str)
tweets['date_posted'] = tweets_test
tweets['date_posted']
0 2020-08-30 19:30:00
1 2020-08-23 20:12:00
2 2020-07-03 15:26:00
3 2018-03-05 13:16:40
4 2018-03-04 04:17:33
...
271 2020-03-28 17:23:17
272 2020-08-24 22:05:42
273 2021-08-28 19:54:31
274 2021-12-25 10:50:33
275 2022-12-29 18:45:12
Name: date_posted, Length: 276, dtype: datetime64[ns]
As a result of our processing, all of the values in the Date posted column are now datetime objects.
tweets['date_posted'].apply(lambda x: isinstance(x, datetime)).describe()
count 276 unique 1 top True freq 276 Name: date_posted, dtype: object
Since the Date Posted column values were all properly converted to datetime objects, we can now remove tweets posted before 2020 from the dataset. Since rows were removed, we also reset the indices of the dataframe in order to avoid confusion.
tweets = tweets[~tweets.date_posted.apply(lambda x: x.year < 2020)]
tweets.reset_index(drop=True, inplace=True)
tweets
| un | tweet_url | account_handle | account_name | account_bio | account_type | joined | following | followers | location | ... | content_type | likes | replies | retweets | quote_tweets | marcos_supporter | duterte_supporter | stance_explanation | red_tagging | tone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 17-1 | https://twitter.com/DelicadoJuanito/status/130... | @DelicadoJuanito | Juanito Delicado | KKK Philippines member last generation Miriam ... | Anonymous | 2020-08-01 00:00:00 | 38.0 | 3 | NaN | ... | Emotional | 0.0 | 0.0 | 0.0 | 0.0 | False | True | Unknown stance on Marcos\n\nTweet about Dutert... | True | Negative |
| 1 | 17-2 | https://twitter.com/MDSOnwardPH22/status/12975... | @MDSOnwardPH22 | ❤💞💞🇵🇭🇵🇭🙏🙏MARCOS-DUTERTE👊🏽👊🏽👊🏽 | When helping the poor,leave the camera at home! | Anonymous | 2018-05-01 00:00:00 | 7982.0 | 12431 | Manila | ... | Emotional | 37.0 | 2.0 | 14.0 | 4.0 | True | True | Account name | True | Negative |
| 2 | 17-3 | https://twitter.com/ysaysantos24/status/127895... | @ysaysantos24 | Relissa Lucena | YaKaP ng Magulang\nAj is my life, maging makat... | Identified | 2020-01-01 00:00:00 | 183.0 | 953 | NaN | ... | Emotional | 84.0 | 4.0 | 28.0 | 3.0 | False | True | Unknown stance on Marcos \n\nLikes tweets supp... | True | Negative |
| 3 | 17-12 | https://twitter.com/BasarteDiaz/status/1294961... | @BasarteDiaz | Senior Herudes | Electrical Engineer\nProud DDS\nDuterte Delici... | Identified | 2020-03-01 00:00:00 | 195.0 | 45 | NaN | ... | Emotional | 0.0 | 0.0 | 0.0 | 0.0 | False | True | From the bio | True | Negative |
| 4 | 17-15 | https://twitter.com/RightWingPinoy/status/1287... | @RightWingPinoy | A Machiavelli | fascist • right wing • business | Anonymous | 2019-03-01 00:00:00 | 180.0 | 68 | Calamba City, Calabarzon | ... | Rational | 0.0 | 0.0 | 0.0 | 0.0 | True | True | Retweeted a thread promoting Marcos and Duterte | True | Negative |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 247 | NaN | https://twitter.com/icekemberloo/status/124383... | icekemberloo | yuki | 🐕🐶🍓🥝🧀🍖🍗🥓🍱🍙🍳🥚🥙🌯🍜🍩🍫🍬 | Anonymous | 2020-03-25 23:21:15 | NaN | 13 | National Capital Region, Repub | ... | Rational, Emotional | 1.0 | 0.0 | 0.0 | 0.0 | True | True | Has multiple tweets supporting Duterte-Marcos | False | Negative |
| 248 | NaN | https://twitter.com/PilipinasScm/status/129789... | PilipinasScm | Student Christian Movement of the PH #SCMPat62 | Est 27 Dec 1960. Follow Christ, (Rom 3:22) Lov... | Media | 2019-04-08 19:22:59 | NaN | 4361 | NaN | ... | Rational | 0.0 | 0.0 | 0.0 | 0.0 | False | False | Media account | False | Neutral |
| 249 | NaN | https://twitter.com/anakbayanMBHS/status/14315... | anakbayanMBHS | Anakbayan Metro-Baguio High School | Isang komprehensibong pambansa demokratikong p... | Media | 2020-04-27 21:29:57 | NaN | 348 | Baguio City | ... | Rational, Emotional | 10.0 | 1.0 | 5.0 | 0.0 | False | False | Anakbayan account | False | Neutral |
| 250 | NaN | https://twitter.com/anakbayan_ne/status/147457... | anakbayan_ne | Anakbayan Nueva Ecija | Lumalakas, lumalawak, lumalaban! #SumaliSaAnak... | Media | 2020-06-14 22:11:58 | NaN | 889 | Nueva Ecija | ... | Emotional | 5.0 | 0.0 | 8.0 | 0.0 | False | False | Anakbayan account | False | Neutral |
| 251 | NaN | https://twitter.com/BoniArtKolektib/status/160... | BoniArtKolektib | Bonifacio Artists Collective | Bonifacio Artists Collective is a cultural org... | Media | 2021-11-18 17:26:44 | NaN | 119 | Taguig City | ... | Rational | 0.0 | 0.0 | 0.0 | 0.0 | False | False | https://twitter.com/BoniArtKolektib/status/157... | False | Neutral |
252 rows × 24 columns
Our only data that requires encoding are the columns for Marcos supporter and Duterte supporter.
tweets['marcos_supporter'] = tweets['marcos_supporter'].replace({True: 1, False: 0})
tweets['marcos_supporter']
0 0
1 1
2 0
3 0
4 1
..
247 1
248 0
249 0
250 0
251 0
Name: marcos_supporter, Length: 252, dtype: int64
tweets['duterte_supporter'] = tweets['duterte_supporter'].replace({True: 1, False: 0})
tweets['duterte_supporter']
0 1
1 1
2 1
3 1
4 1
..
247 1
248 0
249 0
250 0
251 0
Name: duterte_supporter, Length: 252, dtype: int64
One of the assumptions of the chi$^2$ test is that a single observation must be a distinct choice between classes. This currently does not fit our dataset, as we aim to test the columns marcos_supporter and duterte_supporter, with both being checked if they're a supporter of both, and neither being checked if they're a supporter of neither.
Thus, we create four new columns following the one-hot encoding scheme:
supports_marcos_onlysupports_duterte_onlysupports_bothsupports_neithertweets['supports_marcos_only'] = tweets['marcos_supporter'] & ~tweets['duterte_supporter']
tweets['supports_duterte_only'] = ~tweets['marcos_supporter'] & tweets['duterte_supporter']
tweets['supports_both'] = tweets['marcos_supporter'] & tweets['duterte_supporter']
tweets['supports_neither'] = tweets['marcos_supporter'] + tweets['duterte_supporter'] == 0
tweets['supports_neither'].replace({True: 1, False: 0}, inplace=True)
# For data visualization:
def pol_stance(x):
if x['marcos_supporter'] & ~x['duterte_supporter']:
return "Marcos only"
elif ~x['marcos_supporter'] & x['duterte_supporter']:
return "Duterte only"
elif x['marcos_supporter'] & x['duterte_supporter']:
return "Marcos-Duterte"
return 'Neither'
# pol_labels = ["Supports both", "Supports Duterte only", "Supports Marcos only", "Supports Neither"]
tweets['pol_stance'] = tweets.apply(pol_stance, axis=1)
tweets[['supports_marcos_only', 'supports_duterte_only', 'supports_both', 'supports_neither', 'pol_stance']]
| supports_marcos_only | supports_duterte_only | supports_both | supports_neither | pol_stance | |
|---|---|---|---|---|---|
| 0 | 0 | 1 | 0 | 0 | Duterte only |
| 1 | 0 | 0 | 1 | 0 | Marcos-Duterte |
| 2 | 0 | 1 | 0 | 0 | Duterte only |
| 3 | 0 | 1 | 0 | 0 | Duterte only |
| 4 | 0 | 0 | 1 | 0 | Marcos-Duterte |
| ... | ... | ... | ... | ... | ... |
| 247 | 0 | 0 | 1 | 0 | Marcos-Duterte |
| 248 | 0 | 0 | 0 | 1 | Neither |
| 249 | 0 | 0 | 0 | 1 | Neither |
| 250 | 0 | 0 | 0 | 1 | Neither |
| 251 | 0 | 0 | 0 | 1 | Neither |
252 rows × 5 columns
tweets[['supports_marcos_only', 'supports_duterte_only', 'supports_both', 'supports_neither']].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 252 entries, 0 to 251 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 supports_marcos_only 252 non-null int64 1 supports_duterte_only 252 non-null int64 2 supports_both 252 non-null int64 3 supports_neither 252 non-null int64 dtypes: int64(4) memory usage: 8.0 KB
tweets[tweets['tweet_translated'].isna()].shape[0]
76
Before conducting natural language processing on the dataset, we must ensure that the tweets used are all in English. However, since not all tweets had to be translated, the dataset's tweet_translated column has some empty values.
In order to preserve the original values in the dataset, we create a new column final_tweet to collect the tweet contents that will be used in the data processing.
tweets['final_tweet'] = tweets['tweet_translated'].fillna(tweets['tweet'])
tweets['final_tweet']
0 You killed them, not the government. Don't you...
1 Poor youth. They are wasting their bright futu...
2 Why do we need to protect our children against...
3 They are the group that knows nothing but free...
4 That's nice. LFS, Anakbayan, Kabataan, Gabriel...
...
247 @CarmiLu68 so true! .. most of them are millen...
248 @SCMP_Tuguegarao @SCMPDavao @scmpmetrobaguio @...
249 We are the Anakbayan Metro-Baguio Highschool, ...
250 Along with our celebration of Christmas Day, w...
251 Temporary Anakbayan FB page: https://t.co/lW80...
Name: final_tweet, Length: 252, dtype: object
tweets[tweets['final_tweet'].isna()]
| un | tweet_url | account_handle | account_name | account_bio | account_type | joined | following | followers | location | ... | duterte_supporter | stance_explanation | red_tagging | tone | supports_marcos_only | supports_duterte_only | supports_both | supports_neither | pol_stance | final_tweet |
|---|
0 rows × 30 columns
With our dataset cleaned up, we can look at the distribution of values in the dataset.
tweets.describe().style
| following | followers | likes | replies | retweets | quote_tweets | marcos_supporter | duterte_supporter | supports_marcos_only | supports_duterte_only | supports_both | supports_neither | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 126.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 |
| mean | 725.126984 | 2711.575397 | 16.702381 | 1.067460 | 6.158730 | 0.670635 | 0.440476 | 0.503968 | 0.039683 | 0.103175 | 0.400794 | 0.456349 |
| std | 1329.188107 | 6638.969192 | 54.263511 | 4.209576 | 20.321093 | 3.805879 | 0.497432 | 0.500979 | 0.195601 | 0.304792 | 0.491035 | 0.499082 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 130.000000 | 168.500000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 302.000000 | 564.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 745.500000 | 1420.750000 | 8.000000 | 1.000000 | 2.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| max | 9381.000000 | 29298.000000 | 531.000000 | 46.000000 | 203.000000 | 41.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
# Check tone of tweets
print("Number of positive tweets: ", tweets[tweets['tone'] == 'Positive']['tone'].count())
print("Number of negative tweets: ", tweets[tweets['tone'] == 'Negative']['tone'].count())
print("Number of neutral tweets: ", tweets[tweets['tone'] == 'Neutral']['tone'].count())
Number of positive tweets: 15 Number of negative tweets: 144 Number of neutral tweets: 92
We then try to visualize a general overview of the tweets in our dataset based on our classification of whether or not they are Marcos or Duterte supporters.
marcos = tweets.query("marcos_supporter == 1").shape[0]
duterte = tweets.query("duterte_supporter == 1").shape[0]
marcos_duterte = tweets.query("marcos_supporter == 1 and duterte_supporter == 1").shape[0]
marcos_only = tweets.query("marcos_supporter == 1 and duterte_supporter == 0").shape[0]
duterte_only = tweets.query("marcos_supporter == 0 and duterte_supporter == 1").shape[0]
neither = tweets.query("marcos_supporter == 0 and duterte_supporter == 0").shape[0]
total = tweets.shape[0]
pie_data = np.array([marcos_duterte, marcos_only, duterte_only, neither])
pie_labels = [
"Marcos-Duterte",
"Marcos only",
"Duterte only",
"Neither"
]
interactive_pie = go.Pie(labels=pie_labels, values=pie_data)
fig = go.Figure(data=interactive_pie)
fig.update_layout(
title_text='Poster Political Leaning', # title of plot
width=625
)
fig.show(width=625)
# chart_studio.plotly.iplot(fig, filename = 'political-leaning', auto_open=True)
rational = tweets.query("content_type == \"Rational\"").shape[0]
emotional = tweets.query("content_type == \"Emotional\"").shape[0]
transactional = tweets.query("content_type == \"Transactional\"").shape[0]
content_type_data = np.array([rational, emotional, transactional])
content_type_labels = [
"Rational",
"Emotional",
"Transactional",
]
acct_type_counts = pd.DataFrame({
'Content Type': content_type_labels,
'No. of tweets': content_type_data
})
fig = px.bar(acct_type_counts, x="Content Type", y="No. of tweets", title="Content Type of collected tweets")
fig.update_layout(
width=625
)
fig.show()
It can be seen that the majority of the tweets collected were Emotional, with a lot of them also being replies to other tweets.
emotional_tweets = tweets.query("content_type == 'Emotional'")
emotional_tweets[['tweet', 'content_type', 'tweet_type']]
reply_count = emotional_tweets[emotional_tweets['tweet_type'].str.contains('Reply')].shape[0]
print(f"Number of Emotional tweets that are also replies: {reply_count}")
emotional_tweets[['tweet', 'content_type', 'tweet_type']]
Number of Emotional tweets that are also replies: 105
| tweet | content_type | tweet_type | |
|---|---|---|---|
| 0 | Kayo po pumatay d ang government. Huwag nyo k... | Emotional | Text, Reply |
| 1 | Kawawang kabataan,sinayang ang magandang kinab... | Emotional | Text, Video |
| 2 | Bakit namin kailangan protektahan ang aming mg... | Emotional | Text, Image |
| 3 | Sila yung grupo na walang alam kundi freedom o... | Emotional | Text, Image |
| 5 | May mga woke na taga US din. Nagco-comment ka... | Emotional | Text, Reply |
| ... | ... | ... | ... |
| 240 | Paano natin nasasabing "pasista" ang isang "pa... | Emotional | Text |
| 241 | hapee birthday anakbayan ne ako toh si anna ta... | Emotional | Text |
| 245 | Sumali sa Anakbayan PUP-COC: https://t.co/ezVZ... | Emotional | Text, Reply, Link |
| 246 | Sabay-sabay nating palawakin ang ating kaalama... | Emotional | Text, Reply |
| 250 | Kasabay ng ating pagsasalo-salo ngayong Araw n... | Emotional | Text, Image |
171 rows × 3 columns
def all_quarters():
ret = []
years = [str(x) for x in range(2020, 2023)]
quarters = ["Q1", "Q2", "Q3", "Q4"]
for year in years:
for qtr in quarters:
ret.append(year + qtr)
return ret
quarter_posted = pd.PeriodIndex(tweets["date_posted"], freq='Q')
tweets['quarter_posted'] = quarter_posted
quarter_counts = list(map(lambda qtr: (tweets['quarter_posted']==qtr).sum(), all_quarters()))
quarter_counts
[10, 54, 18, 11, 15, 13, 12, 17, 33, 53, 5, 11]
# Heatmap of when collected tweets were collected
data=np.array([[quarter_counts[x] for x in range(i*4,i*4 + 4)] for i in range(0, 3)])
data = data.T
fig = px.imshow(data,
labels=dict(x="Year", y="Month", color="Tweets"),
x=[str(x) for x in range(2020, 2023)],
y=['Q1', 'Q2', 'Q3', 'Q4'],
title="Distribution of 'Date posted' for tweets by quarter"
)
fig.update_layout(
width=625
)
fig.show()
# chart_studio.plotly.iplot(fig, filename = 'date-posted', auto_open=True)
By visualizing the distribution of post dates, we were hoping to gain additional insights on the context of the increase or decrease in the number of redtagging tweets posted during certain periods.
From the 252 tweets collected, the greatest number of redtagging tweets were found during following quarters: Q2 2020, Q2 2022, and Q1 2022.
Though the scope is limited to the 252 tweets that were collected by the researchers, which could have been affected by any biases introduced by Twitter's search algorithm, the surge in numbers coincides with two major events:
# creating a copy of tweets df to sort
pol_df = tweets[['red_tagging', 'pol_stance', 'tone']]
# pol_df.sort_values(['pol_stance'],ascending=[True],inplace=True)
fig = px.histogram(pol_df, x="red_tagging",
color='pol_stance', barmode='group',
histfunc='count', color_discrete_map={
'Marcos-Duterte' : '#ef543a',
'Marcos only' : '#ab63fa',
'Duterte only' : '#00cd97',
'Neither' : '#636ffb'
})
fig.update_layout(
title_text='Distribution of political stances of red-tagging vs non-red-tagging', # title of plot
xaxis_title_text='Is the tweet red-tagging?', # xaxis label
yaxis_title_text='Count', # yaxis label
bargap=0.2, # gap between bars of adjacent location coordinates
bargroupgap=0.1, # gap between bars of the same location coordinates
legend_title="Political Stance",
width=625
)
fig.show()
# chart_studio.plotly.iplot(fig, filename = 'red-vs-nonred', auto_open=True)
Visualizing the presence of red-tagging in a tweet compared with the political stance of the user that tweeted it, it can be seen that an overwhelming majority of the red-tagging tweets came from Marcos-Duterte, Duterte-only, and Marcos-only supporters. 125 out of 131 red-tagging tweets came from those groups.
The chart also shows that a majority of the non-redtagging tweets came from those who support neither, with 109 out of the 121 non-redtagging tweets being from that group.
fig = px.histogram(pol_df, x="tone",
color='pol_stance', barmode='group',
histfunc='count',
color_discrete_map={
'Marcos-Duterte' : '#ef543a',
'Marcos only' : '#ab63fa',
'Duterte only' : '#00cd97',
'Neither' : '#636ffb'
})
fig.update_layout(
title_text='Distribution of political stances of different tones of tweets', # title of plot
xaxis_title_text='Tone', # xaxis label
yaxis_title_text='Count', # yaxis label
bargap=0.2, # gap between bars of adjacent location coordinates
bargroupgap=0.1, # gap between bars of the same location coordinates
legend_title="Political Stance",
width=625
)
fig.show()
# chart_studio.plotly.iplot(fig, filename = 'tweet-tones', auto_open=True)
The tones of the tweets also had a great reflection on political stance: no supporters of Marcos, Duterte, or both had anything positive to say about the organization Anakbayan. Only one Marcos-Duterte supporter tweeted something with a neutral sentiment. Apart from that single tweet, the rest of the tweets from Marcos-only, Duterte-only, and Marcos-Duterte supporters all had a negative tone.
While around 8 supporters of neither also expressed negative views toward Anakbayan, the rest of the tweets from that group had either neutral or positive sentiments.
For this test, we utilized the Chi-square test for independence with the Bonferroni correction applied (Wijaya, 2020). This is because we're testing for the independence of the presence of red-tagging from four different one-hot encoded features: supports_marcos_only, supports_duterte_only, supports_both, and supports_neither.
This means that for a confidence level of 95% (p-value < 0.05), since we're doing comparisons among four columns, we divide 0.05/4 to arrive at 0.0125 as our adjusted signifance level (α).
$H$: Fanatics/apologists of Duterte are more likely to red-tag Anakbayan as a terrorist organization.
$H_A$: Fanatics/apologists of Marcos are more likely to red-tag Anakbayan as a terrorist organization.
$H_0$: Both fanatics and non-fanatics are equally likely to participate in the red-tagging of Anakbayan.
from scipy.stats import chi2_contingency
dataset_columns = ['supports_duterte_only', 'supports_marcos_only', 'supports_both', 'supports_neither']
result_columns = []
for col in dataset_columns:
df_col = tweets[col]
bon_alpha = 0.05/len(dataset_columns)
contingency_table = pd.crosstab(tweets['red_tagging'], df_col)
chi2, p_value, dof, expected = chi2_contingency(contingency_table, correction=False)
hypothesis_result = 'Reject null hypothesis' if p_value < bon_alpha else 'Fail to reject null hypothesis'
result_values = [col, hypothesis_result, chi2, p_value, bon_alpha, dof, expected]
result_columns.append(result_values)
res_chi_ph = pd.DataFrame(data = result_columns)
res_chi_ph.columns = [
'Political Leaning',
'Hypothesis',
'chi^2 statistic',
'p-value',
'Bonferroni α',
'Degrees of freedom',
'Expected frequencies'
]
res_chi_ph
| Political Leaning | Hypothesis | chi^2 statistic | p-value | Bonferroni α | Degrees of freedom | Expected frequencies | |
|---|---|---|---|---|---|---|---|
| 0 | supports_duterte_only | Reject null hypothesis | 22.659960 | 1.933553e-06 | 0.0125 | 1 | [[108.51587301587301, 12.484126984126984], [11... |
| 1 | supports_marcos_only | Fail to reject null hypothesis | 3.274447 | 7.036666e-02 | 0.0125 | 1 | [[116.1984126984127, 4.801587301587301], [125.... |
| 2 | supports_both | Reject null hypothesis | 103.265098 | 2.931729e-24 | 0.0125 | 1 | [[72.50396825396825, 48.49603174603175], [78.4... |
| 3 | supports_neither | Reject null hypothesis | 185.351602 | 3.288824e-42 | 0.0125 | 1 | [[65.78174603174604, 55.21825396825397], [71.2... |
The results above show that we can reject the null hypothesis for those that support Duterte only, those that support both Marcos and Duterte, and those that support neither. This means that both Duterte-only and Marcos-Duterte supporters are highly likely to red-tag Anakbayan, while supporters of neither are highly unlikely to do so.
As for those who support Marcos only, we weren't able to obtain a statistically significant result, likely because the dataset contained only 10 samples from Marcos-only supporters.
To choose the number of topics, we looked at the result of topic modelling with 3, 4, and 5 topics to see what would give us the most insights.
The figure above is when the number of topics is 3. With three topics, there is plenty of overlap between keywords without much to say about each individual topic.
The figure above is when the number of topics is 4. While the topics are more distinguishable, from the figure, the clusters are not compact. This can be seen especially in topic 1 with some dots that are closer to other topics than topic 1.
%%capture
# Initialize NLP components
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from textblob import TextBlob
!pip install pyspellchecker
from spellchecker import SpellChecker
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
spell = SpellChecker()
%%capture
!pip install emoji --upgrade
# Topic modeling via LDA
# Source: https://www.kaggle.com/code/infamouscoder/lda-topic-modeling-features
import re
import emoji
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import decomposition
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
# Custom tokenizer
def tokenizer(text):
text = emoji.replace_emoji(text, replace='') # remove emojis
text = re.sub(r"http\S+", "", text) # remove URLs
text = re.sub(r"[^\w\s]", "", text) # remove whitespaces
tokens = [word for word in word_tokenize(text) if len(word)>3] # keep only 4+-length words
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
filtered_tokens = [token for token in lemmatized_tokens if token.lower() not in stop_words]
# stemmed_tokens = [stemmer.stem(item) for item in filtered_tokens]
return filtered_tokens
# Generate features
tf_vectorizer = TfidfVectorizer(tokenizer=tokenizer,
max_df=0.75, max_features=10000,
use_idf=True, norm=None, token_pattern=None)
tf_vectors = tf_vectorizer.fit_transform(tweets.final_tweet)
# Create top 5 topics
n_topics = 5
lda = decomposition.LatentDirichletAllocation(n_components=n_topics, max_iter=10,
learning_method='online', learning_offset=50,
n_jobs=1, random_state=42,
topic_word_prior=0.000001)
W = lda.fit_transform(tf_vectors)
H = lda.components_
# Show top 10 relevant words for each of the 5 topics
num_words = 8
vocab = np.array(tf_vectorizer.get_feature_names_out())
top_words = lambda t: [vocab[i] for i in np.argsort(t)[:-num_words-1:-1]]
topic_words = ([top_words(t) for t in H])
topics = [' '.join(t) for t in topic_words]
df_topics = pd.DataFrame(topics, columns=['Keywords'])
df_topics['Topic ID'] = range(1, len(topics) + 1)
df_topics
| Keywords | Topic ID | |
|---|---|---|
| 0 | anakbayan scmp_uplb member mass nusphilippines... | 1 |
| 1 | anakbayan terrorist leader people youth activi... | 2 |
| 2 | anakbayan communist front accord anakbayan_ph ... | 3 |
| 3 | youth anakbayan partylist know group front fil... | 4 |
| 4 | anakbayan kabataan bayan like muna front group... | 5 |
# Assign topic to each tweet
topicid = ["Topic" + str(i+1) for i in range(lda.n_components)]
tweetid = ["Tweet" + str(i+1) for i in range(len(tweets.final_tweet))]
df_topics_lda = pd.DataFrame(np.round(W,2), columns=topicid, index=tweetid)
significanttopic = np.argmax(df_topics_lda.values, axis=1)+1
df_topics_lda['dominant_topic'] = significanttopic
df_topics_lda['breakdown'] = df_topics_lda.apply(lambda row: '\n'.join([f'{col}: {row[col]}'
for col in sorted(df_topics_lda.columns, key=lambda x: row[x], reverse=True)
if row[col] > 0 and col != 'dominant_topic']), axis=1)
df_topics_lda.head(10)
| Topic1 | Topic2 | Topic3 | Topic4 | Topic5 | dominant_topic | breakdown | |
|---|---|---|---|---|---|---|---|
| Tweet1 | 0.01 | 0.01 | 0.01 | 0.98 | 0.01 | 4 | Topic4: 0.98\nTopic1: 0.01\nTopic2: 0.01\nTopi... |
| Tweet2 | 0.00 | 0.00 | 0.00 | 0.98 | 0.00 | 4 | Topic4: 0.98 |
| Tweet3 | 0.00 | 0.00 | 0.00 | 0.99 | 0.00 | 4 | Topic4: 0.99 |
| Tweet4 | 0.00 | 0.00 | 0.00 | 0.99 | 0.00 | 4 | Topic4: 0.99 |
| Tweet5 | 0.99 | 0.00 | 0.00 | 0.00 | 0.00 | 1 | Topic1: 0.99 |
| Tweet6 | 0.00 | 0.00 | 0.98 | 0.00 | 0.00 | 3 | Topic3: 0.98 |
| Tweet7 | 0.00 | 0.00 | 0.00 | 0.98 | 0.00 | 4 | Topic4: 0.98 |
| Tweet8 | 0.00 | 0.00 | 0.00 | 0.00 | 0.99 | 5 | Topic5: 0.99 |
| Tweet9 | 0.00 | 0.00 | 0.79 | 0.00 | 0.20 | 3 | Topic3: 0.79\nTopic5: 0.2 |
| Tweet10 | 0.00 | 0.00 | 0.00 | 0.00 | 0.99 | 5 | Topic5: 0.99 |
# Visualize topics
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
# Apply t-SNE for dimensionality reduction
tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(df_topics_lda.iloc[:,:n_topics])
# Apply K-means clustering
kmeans = KMeans(n_clusters=n_topics, n_init=10, random_state=42)
cluster_labels = kmeans.fit_predict(df_topics_lda.iloc[:,:n_topics])
# Create a new dataframe with t-SNE coordinates and cluster labels
import textwrap
def split_text(text, max_length):
lines = textwrap.wrap(text, width=max_length, break_long_words=False)
return "<br>".join(lines)
df_topics_cluster = pd.DataFrame({'X': tsne_result[:, 0],
'Y': tsne_result[:, 1],
'Tweet': tweets['final_tweet'],
'Cluster': df_topics_lda.reset_index()['dominant_topic'].astype(str), # topics via LDA
# 'Cluster': cluster_labels}, # clusters via K-means
'Breakdown': df_topics_lda.reset_index()['breakdown']})
df_topics_cluster['Tweet'] = df_topics_cluster['Tweet'].apply(lambda x: split_text(x, 40))
df_topics_cluster['Breakdown'] = df_topics_cluster['Breakdown'].str.replace('\n','<br>')
df_topics_cluster.head(10)
| X | Y | Tweet | Cluster | Breakdown | |
|---|---|---|---|---|---|
| 0 | -15.042162 | -147.134689 | You killed them, not the government.<br>Don't ... | 4 | Topic4: 0.98<br>Topic1: 0.01<br>Topic2: 0.01<b... |
| 1 | -130.925217 | -44.245590 | Poor youth. They are wasting their<br>bright f... | 4 | Topic4: 0.98 |
| 2 | -260.619843 | -178.951859 | Why do we need to protect our children<br>agai... | 4 | Topic4: 0.99 |
| 3 | -260.619843 | -178.951859 | They are the group that knows nothing<br>but f... | 4 | Topic4: 0.99 |
| 4 | -214.089203 | 161.981995 | That's nice. LFS, Anakbayan, Kabataan,<br>Gabr... | 1 | Topic1: 0.99 |
| 5 | 138.431992 | -439.232208 | There are also woke people from the US.<br>Com... | 3 | Topic3: 0.98 |
| 6 | -130.925217 | -44.245590 | They are saying this for a long time<br>now. E... | 4 | Topic4: 0.98 |
| 7 | 144.970978 | 375.820251 | Hahaha. Like LFS? Anakbayan? These are<br>CPP/... | 5 | Topic5: 0.99 |
| 8 | 317.921783 | -370.787506 | For the supporters of NPA legal fronts<br>like... | 3 | Topic3: 0.79<br>Topic5: 0.2 |
| 9 | 144.970978 | 375.820251 | Miserable? 😂 CPP-NPA linked behind this<br>act... | 5 | Topic5: 0.99 |
# Plot tweets as colored points
df_topics_cluster.sort_values('Cluster', key=lambda x: pd.to_numeric(x, errors='coerce'), inplace=True)
fig = px.scatter(df_topics_cluster, x='X', y='Y', color='Cluster',
title='Topic Clustering using LDA and t-SNE with 5 topics',
hover_name='Tweet',
hover_data={'X':False, 'Y':False, 'Cluster':False, 'Tweet':False, 'Breakdown':True})
for i, keyword in enumerate(df_topics['Keywords']):
fig.add_annotation(
x=0,
y=-0.2*(i/5)-0.08,
text="Topic %d: %s"%(i+1, keyword.replace(' ', ', ')),
showarrow=False,
xref='paper',
yref='paper',
align='left',
font=dict(color=fig.data[i].marker['color'])
)
fig.update_layout(height=710,
xaxis_title='', yaxis_title='',
margin=dict(b=200),
width=700,
paper_bgcolor='#2c3e50',
title=dict(font=dict(color='white')),
legend=dict(title="Topic", font=dict(color='white')))
fig.show()
# chart_studio.plotly.iplot(fig, filename = 'modeling-5-topics', auto_open=True)
With five topics, we see an improvement from four topics where the topics have some distinguishing features with the actual dots being closer together. We can distinguish the topics as follows:
scmp_uplb) and National Union of Students of the Philippines (nusphilippines) which means that many tweets talk about Anakbayan together with student organizations.Scattered among the topics is the keyword front which is used in many tweets in general that call Anakbayan a front for the CPP-NPA.